population loss
Generalization of Model-Agnostic Meta-Learning Algorithms: Recurring and Unseen Tasks
In this paper, we study the generalization properties of Model-Agnostic MetaLearning (MAML) algorithms for supervised learning problems. We focus on the setting in which we train the MAML model over mtasks, each with ndata points, and characterize its generalization error from two points of view: First, we assume the new task at test time is one of the training tasks, and we show that, for strongly convex objective functions, the expected excess population loss is bounded by O(1/mn). Second, we consider the MAML algorithm's generalization to an unseen task and show that the resulting generalization error depends on the total variation distance between the underlying distributions of the new task and the tasks observed during the training process. Our proof techniques rely on the connections between algorithmic stability and generalization bounds of algorithms. In particular, we propose a new definition of stability for meta-learning algorithms, which allows us to capture the role of both the number of tasks mand number of samples per task non the generalization error of MAML.
A theory of learning data statistics in diffusion models, from easy to hard
Bardone, Lorenzo, Merger, Claudia, Goldt, Sebastian
While diffusion models have emerged as a powerful class of generative models, their learning dynamics remain poorly understood. We address this issue first by empirically showing that standard diffusion models trained on natural images exhibit a distributional simplicity bias, learning simple, pair-wise input statistics before specializing to higher-order correlations. We reproduce this behaviour in simple denoisers trained on a minimal data model, the mixed cumulant model, where we precisely control both pair-wise and higher-order correlations of the inputs. We identify a scalar invariant of the model that governs the sample complexity of learning pair-wise and higher-order correlations that we call the diffusion information exponent, in analogy to related invariants in different learning paradigms. Using this invariant, we prove that the denoiser learns simple, pair-wise statistics of the inputs at linear sample complexity, while more complex higher-order statistics, such as the fourth cumulant, require at least cubic sample complexity. We also prove that the sample complexity of learning the fourth cumulant is linear if pair-wise and higher-order statistics share a correlated latent structure. Our work describes a key mechanism for how diffusion models can learn distributions of increasing complexity.
Response to reviewers for the paper: " On Lazy Training in Differentiable Programming "
We thank the reviewers for their comments and suggestions. Hereafter, we list reviewers' (sometimes paraphrased) Each answer will translate into a clarification in the final version. Reviewer #2 and #3 felt that our message was lacking clarity. A.2). We will add more pointers to their statistical analysis, from the existing literature (e.g. L81-90 in the main paper, often ฮฑ(m) = 1/ m in these works).